-
Notifications
You must be signed in to change notification settings - Fork 31
Overhaul of the architecture #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
2b0acfc
to
ea75499
Compare
ea75499
to
04fac65
Compare
04fac65
to
a914dab
Compare
kaitaistruct.py
Outdated
return cls.from_file(Path(file), map) | ||
elif isinstance(file, Path): | ||
with file.open("rb") as f: | ||
return cls.from_file(f, map) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're losing this "try-catch" logic that we had before, which attempted to close file after failure.
I'm by no mean a Python expert, so I'm not sure if that logic made sense in the first place. Can you explan why have you removed it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're losing this "try-catch" logic that we had before, which attempted to close file after failure.
A context manager (with
) is used. It is a replacement for try
except
and is something like RAII: when the block of a context manager is exitted the resource in its header is __exit__
ed. In fact context managers are present in python for pretty long time, they were introduced in python 2.5 in 2008. Explicit usage of try
except
for the cases where a context manager is needed is considered to be an antipattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall that it was tried once, but I had to revert it: 7d0b226 — have you checked if it works now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't check the tests since they are not in this repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case you actually cannot use with
. In from_file
you normally want the file to stay open so that the returned KaitaiStream
can use it, and once the KaitaiStream
is closed, the underlying file is also closed. But if an exception is thrown while creating the KaitaiStream
, you need to closed the file manually, because then from_file
doesn't return a KaitaiStream
object that the user could use to close the file. with
will always close the file at the end of the block, regardless of whether an exception is thrown, which is not the desired behavior here, so the manual try
/except
solution is actually necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are completely right.
9ff92ab
to
d9742df
Compare
6e902e5
to
421d79c
Compare
I have tried to refactor the architecture of the runtime to make it composable. It would require some modifications to KSC.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm coming a bit late to this PR, so maybe there's some context I'm missing. What's the motivation behind adding support for memory-mapped IO? Do you have a particular use case where the stream-based IO operations are a performance bottleneck? Also, as far as I can tell, the runtime uses the memory-mapped file exactly like a stream-based file - does memory-mapped IO offer any actual performance benefits this way?
The structure of the code is unfortunately not very clear to me, even with the explanations in #25 (comment). The code uses a number of ABCs, but with no documentation about why they exist, who is meant to extend/implement them, and what the expected behavior for their abstract methods is. More specifically:
- What's the need for splitting
KaitaiStruct
into a hierarchy of three classes (_KaitaiStruct_
,_KaitaiStruct
,KaitaiStruct
)? Is there any advantage to using the superclasses, which have less functionality thanKaitaiStruct
? Also, it's quite confusing to have three classes with identical names except for the number of underscores. - What's the need for
IKaitaiDownStream
and its implementations? All objects that it wraps (file objects,io.BytesIO
,mmap.mmap
) support normal Python stream operations, so there should be no need for a wrapper interface/ABC. - What's the purpose of
KaitaiParser
? It appears to be completely unimplemented.
Some general points about the code:
- There are some blatant errors in the code that make it impossible to import, let alone run correctly. I know you said that changes in the compiler would be needed to use this new version of the runtime, so I can understand that you haven't run a full test suite over this. But please do some basic testing before you submit this as a PR and ask for changes in the compiler. (Or if the PR is not yet finished, please mark it as WIP somehow.)
- This PR breaks compatibility with Python 3.4 and older, including Python 2.7. (
typing
was added to the stdlib in 3.5,pathlib
was added in 3.4, function type annotation syntax was added in 3.0). I don't know if dropping support for 3.4 and older is acceptable for KS, it would depend on what versions are frequently used in the KS userbase. In any case, I think this should be discussed and implemented separately. - This PR includes type hints in some places, but not all. If we decide to use type hints, they should be applied consistently everywhere, but again I think that should be a separate PR.
- The explanations from Overhaul of the architecture #25 (comment) should be included in the code as comments/docstrings in the relevant places. Future readers of the code shouldn't have to read a PR discussion to understand the architecture.
- Many identifiers use dromedaryCase, which is not the usual Python naming convention.
kaitaistruct.py
Outdated
import itertools | ||
import sys | ||
import struct | ||
from io import open, BytesIO, SEEK_CUR, SEEK_END # noqa | ||
from _io import _IOBase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't import _io._IOBase
(or anything else from _io
), it's an undocumented implementation detail with no guaranteed behavior. The public documented class to use here is io.IOBase
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path("./a.txt").open("rb").__class__.mro() [<class '_io.BufferedReader'>, <class '_io._BufferedIOBase'>, <class '_io._IOBase'>, <class 'object'>]
I don't see io.IOBase
in __mro__
, it is probably not inherited from it. In fact it is io.IOBase
inherited from _io._IOBase
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment below - io.IOBase
, io.RawIOBase
, and io.BufferedIOBase
are ABCs and have the standard stream types registered as virtual subclasses. So even if io.IOBase
doesn't appear in the MRO, isinstance(Path(...).open("rb"), io.IOBase)
will return True
.
kaitaistruct.py
Outdated
class KaitaiStruct(object): | ||
def __init__(self, stream): | ||
self._io = stream | ||
class _KaitaiStruct_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it's quite confusing to have classes _KaitaiStruct_
and _KaitaiStruct
. Also, a single trailing underscore is quite unusual in a Python identifier, especially in a class name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any ideas how to better name it? Underscores are usual convention for marking something as use with care
, so _KaitaiStruct_
is double-use with care
, but I shouldn't prepend a double underscore because it is reserved in pythonk so I have appended it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Underscores are the usual convention for internal things that you're not really supposed to touch as an external user, so it's not a good naming convention for classes that users are meant to subclass...
Why not KaitaiStructBase
for the base class, and something like NonClosingKaitaiStruct
for the other class?
kaitaistruct.py
Outdated
_parser = None # :KaitaiParser | ||
|
||
def _read(self): | ||
self.__class__._parser.parse(self, self._io) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where/when/how is _parser
supposed to be set? Currently it's always None
, so this is guaranteed to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is meant to be set in the code generated by KSC.
|
||
def __enter__(self): | ||
self._shouldExit = not stream.is_entered | ||
if self._shouldExit: | ||
self._io.__enter__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should assume that __enter__
on IO streams never does anything - it would simplify our own __enter__
implementations a lot. I think this is a safe assumption, because you can always use streams with manual close
calls instead of with
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO - close
should be removed in favour of __exit__
and constructing shouldn't mean __enter__
ing in python 4 for some built-in types. Yes, I am mad enough to assumme that there will be python 4 fixing the mistakes of python 3 somewhen. And I sincerely hope it will happen. In fact I have a wishlist of the changes it is impossible to implement without breaking compatibility I wanna see in python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO -
close
should be removed in favour of__exit__
and constructing shouldn't mean__enter__
ing in python 4 for some built-in types.
That's unlikely to happen. The general convention in the stdlib seems to be that you're never forced to use with
(or __enter__
/__exit__
), because there are always other ways to do what with
would do (such as close
on streams).
Yes, I am mad enough to assumme that there will be python 4 fixing the mistakes of python 3 somewhen. And I sincerely hope it will happen. In fact I have a wishlist of the changes it is impossible to implement without breaking compatibility I wanna see in python.
That will never happen. I can't find a quote for this right now, but the core devs have said that Python 4 won't be a big release that breaks everything like Python 3. I don't think that the devs (or the community) are interested in another messy incompatible version switch like Python 2 to 3 was.
def close(self): | ||
self._io.close() | ||
if self.shouldExit: | ||
self._io.__exit__(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the comment about __enter__
above, I think it's safe to assume that __exit__
on IO streams is equivalent to calling close
.
with KaitaiStream(o) as io: | ||
s = cls(io) | ||
s._read() | ||
return s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work with specs that use pos
instances? Those are compiled to lazy properties, which read their data from the stream only once they are accessed. AFAICT this would break if the file is closed early like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for these case explicit context management is needed.
with OurStructName(KaitaiStream(Path("./aaa"))) as s:
doNeededStuff(s)
Or maybe I even need to try to construct the stream automatically at the cost of an additional check.
kaitaistruct.py
Outdated
|
||
@property | ||
def is_entered(self): | ||
return isinstance(self._io, _IOBase) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost guaranteed to not work, because most built-in IO streams don't really extend _io._IOBase
. You need to use io.IOBase
instead - that will work, because it is an ABC and has all built-in IO stream classes registered as virtual subclasses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path("./a.txt").open("rb").__class__.mro()
[<class '_io.BufferedReader'>, <class '_io._BufferedIOBase'>, <class '_io._IOBase'>, <class 'object'>]
also
BytesIO.mro()
[<class '_io.BytesIO'>, <class '_io._BufferedIOBase'>, <class '_io._IOBase'>, <class 'object'>]
on 3.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, I remembered that incorrectly - apparently the streams from _io
do actually extend _io._IOBase
.
In any case, _io
is not a documented module, so there's no guarantee that _io._IOBase
will be used that way in the future. io.IOBase
is documented and does exactly the same thing here, so you should really use that instead.
kaitaistruct.py
Outdated
|
||
|
||
def get_downstream(x: typing.Union[bytes, str, Path], *args, **kwargs) -> IKaitaiDownStream: | ||
return downstreamMapping[type(x)](x, *args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work correctly for multiple reasons:
- On Unix systems, Python accepts both
str
andbytes
as paths, even in Python 3. This is to allow working with paths that have no consistent encoding. This isn't handled correctly here - thebytes
path will instead be treated as the contents of a stream. - The lookup will fail if
type(x)
is a subclass of one of the types listed in the mapping. This is unlikely to be a problem forbytes
andstr
, but is guaranteed to break forPath
, because there are no instances ofPath
itself (only of its subclasses,WindowsPath
orPosixPath
). - There is no way to create an
IKaitaiDownStream
from an existing IO stream, which means there's no way to implementKaitaiStream.from_io
based on this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Unix systems, Python accepts both str and bytes as paths, even in Python 3. This is to allow working with paths that have no consistent encoding.
Thanks. Haven't known about this fact.
This isn't handled correctly here - the bytes path will instead be treated as the contents of a stream.
How are we going to address it? Path
doesn't seem to support bytes
at all, so we can consider it as dropped. Should they be decoded into strings somehow?
The lookup will fail if type(x) is a subclass of one of the types listed in the mapping. This is unlikely to be a problem for bytes and str, but is guaranteed to break for Path, because there are no instances of Path itself (only of its subclasses, WindowsPath or PosixPath).
Path("./a.txt").__class__.mro()
[pathlib.PosixPath,
pathlib.Path,
pathlib.PurePosixPath,
pathlib.PurePath,
object]
Will be fixed.
- There is no way to create an IKaitaiDownStream from an existing IO stream, which means there's no way to implement KaitaiStream.from_io based on this function.
It's by design. If one needs a stream, he must write a wrapper implementing IKaitaiDownStream
.
Mainly:
All of it would require major changes in the code generated by the KSC and the runtime. But they are not part of this PR, let's do it step by step. The plan is to store only the metadata in additional memory and do all the operations on the mapped memory directly. So when reading for example a Of course it is suitable not for all use cases so a user should be able to choose which IO model he needs.
I haven't done any measurements about this in fact.
To support context management properly.
I currently consider this as a draft, it will likely be changed a lot. I have pushed it to just get your feedback about the proposed architecture.
Don't see any problem in fixing it.
It is officially EOL. Typing is added for clarity, this PR adds a lot of complexity, so it also need something for clarity in order not to get mad. I can remove the typing if 2.7 support is still needed. Or we can try to use the stuff like
Yes, I gonna use |
b2c6e36
to
f75052a
Compare
I don't have a lot of experience with To prevent this automatic copying behavior and save memory as you intended, I think you need to wrap the
Do you mean that if a type has only
I assume this would be optional? Because currently opaque types don't require any specific base class. Also, I'm not sure if the constructor signature is correct - as far as I can tell, opaque types only receive an
OK, but what's the reason for separating the parser from the struct? The two are still tightly coupled - each parser can only work with one struct type, and you cannot create a struct object without a parser.
I don't think it's a compatibility thing... the
I don't think this behavior is a good idea. It's different from how every other IO stream in Python works. There's also no good way to use these classes without |
ctor = downstreamMapping.get(t1, None) | ||
if ctor: | ||
downstreamMapping[t] = ctor | ||
return ctor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just isinstance
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complexity. Doing isinstance
would mean we would have to walk all the types and check each, this is O(n) and maybe more depending on inheritance impl. I have tried to optimize it by a lookup in a dict that should be O(1).
No, I mean either the top
If we deal with a resource that have to be explicitly managed we must use either context management magics or their surrogates (i.e.
It's a feature.
IDK what to answer now, needs some testing.
No strong reasons, just "composition over inheritance" mindset.
It is a compatibility thing - keeping |
OK, I understand what you mean, but I don't think we can differentiate between these two cases at the type level. A top-level type isn't always the type that manages the IO stream (for example if the type is imported and used by another spec), and the type that manages the IO stream isn't always the top-level type of a spec (you can directly use non-top-level types in your application code, e. g. In general I think this would be a good idea though, as it would prevent users from accidentally
I understand that you want to make people do proper resource management, but removing
If the Python devs wanted to change that, they would have done so with the In any case, Python's IO stream behavior is what it is - streams are fully initialized/opened/working when they are created, |
Yes and no.
Yes. That's why the compiler should inherit everythjng from
IMHO, magics cleaner, they make it explicit that we are dealing with a context manager and that we are doing something needing care. The
The whole point of context managers is to get rid of manual freeing. If we make the type to BTW: should we allow the type to be |
f75052a
to
e5a7740
Compare
e5a7740
to
8ec2864
Compare
8ec2864
to
6b30257
Compare
* Added instructions on how to install the tools needed to build and publish the Python runtime. * Fixed formatting of __version__ variable name (the double underscores were previously recognized as Asciidoc italics syntax). * Added step for deleting old built distributions before building new ones, so that files left over from previous builds can't be uploaded by accident. * Added build of wheel in addition to source distribution to the build step (see kaitai-io/kaitai_struct_python_runtime#25). * Added `twine check` step to perform basic checks on the distributions before uploading. * Changed upload instructions from the deprecated `setup.py upload` command to the now recommended `twine upload`. * Updated instructions for providing credentials when uploading. They now no longer need to be stored in an insecure plain text file. * Updated PyPI URLs. * Changed `git push` command so that only the just created tag is pushed, and not any other tags that might exist locally.
6b30257
to
a584140
Compare
2e4d3be
to
bf4ff4f
Compare
1830da3
to
1f6b231
Compare
1f6b231
to
cf35373
Compare
cf35373
to
41d5bf6
Compare
started using memory-mapping files and added
from_any
method